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Metric Learning 
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Yuchi Huang, Member, IEEE, Deyu Meng, and Lei Zhang, Senior Member, IEEE 


Abstract —Distance metric learning aims to learn from the given training data a valid distance metric, with which the similarity between 
data samples can be more effectively evaluated for classification. Metric learning is often formulated as a convex or nonconvex 
optimization problem, while many existing metric learning algorithms become inefficient for large scale problems. In this paper, we 
formulate metric learning as a kernel classification problem, and solve it by iterated training of support vector machines (SVM). The 
new formulation is easy to implement, efficient in training, and tractable for large-scale problems. Two novel metric learning models, 
namely Positive-semidefinite Constrained Metric Learning (PCML) and Nonnegative-coefficient Constrained Metric Learning (NCML), 
are developed. Both PCML and NCML can guarantee the global optimality of their solutions. Experimental results on UCI dataset 
classification, handwritten digit recognition, face verification and person re-identification demonstrate that the proposed metric learning 
methods achieve higher classification accuracy than state-of-the-art methods and they are significantly more efficient in training. 

Index Terms —metric learning, support vector machine, kernel method, Lagrange duality, alternative optimization 
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1 Introduction 

ISTANCE metric learning aims to train a valid 
distance metric which can enlarge the distances 
between samples of different classes and reduce the 
distances between samples of the same class [1]. Metric 
learning is closely related to fc-Nearest Neighbor (fc-NN) 
classification [2], clustering [3], ranking [4], [5], feature 
extraction [6] and support vector machine (SVM) [7], and 
has been widely applied to face recognifion [8], person 
re-idenfificafion [9], [10], image retrieval [11], [12], ac¬ 
tivity recognition [13], document classification [14], and 
link prediction [15], etc. One popular metric learning 
approach is the Mahalanobis distance metric learning, 
which is to learn a linear transformation matrix L or 
a matrix M = L^L from fhe training data. Given two 
samples and Xj, the Mahalanobis distance between 
them is defined as: 

(xi,xj) = ||L(xi -Xj)||2 

= (n - Xj)^M (Xi -xj). 

To satisfy fhe nonnegative property of a distance met¬ 
ric, M should be positive semidefinite (PSD). According 
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to which one of M and L is learned, Mahalanobis 
distance metric learning methods can be grouped into 
two categories. Methods that learn L, including neigh¬ 
borhood components analysis (NCA) [16], large margin 
components analysis (LMCA) [17] and neighborhood 
repulsed metric learning (NRML) [18], are mostly for¬ 
mulated as nonconvex optimization problems, which are 
solved by gradient descent based optimizers. Taking the 
PSD constraint into account, methods that learn M, in¬ 
cluding large margin nearest neighbor (LMNN) [19] and 
maximally collapsing metric learning (MCML) [20], are 
mostly formulated as convex semidefinite programming 
(SDP) problems, which can be optimized by standard 
SDP solvers [19], projected gradient [3], Boostrng-like 
[21], or Frank-Wolfe [22] algorithms. Davis et al. [23] pro¬ 
posed an information-theoretic metric learning (ITML) 
model with an iterative Bregman projection algorithm, 
which does not need projections onto the PSD cone. 
Besides, the use of online solvers for metric learning has 
been discussed in [9], [24], [25]. 

On the other hand, kernel methods [26]-[31] have 
been widely studied in many learning tasks, e.g., semi- 
supervised learning, multiple instance learning, mul¬ 
titask learning, etc. Kernel learning methods, such as 
support vector machine (SVM), exhibit good general¬ 
ization performance. There are many open resources on 
kernel classification mefhods, and a variety of toolboxes 
and libraries have been released [32]-[38]. It is thus 
important to investigate the connections between metric 
learning and kernel classification and explore how to 
utilize the kernel classification resources in the research 
and development of metric learning methods. 

In this paper, we propose a novel formulation of 
metric learning by casting it as a kernel classification 
problem, which allows us to effectively and efficiently 
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TABLE 1: Summary of main abbreviations 


Abbreviation 

Full Name 

PSD 

Positive semidefinite (matrix) 

SDP 

Semidefinite programming 

fc-NN 

fc-nearest neighbor (classification) 

KKT 

Karush-Kuhn-Tucker (condition) 

SVM 

Support vector machine 

LMCA [17] 

Large margin components analysis 

LMNN [2] 

Large margin nearest neighbor 

NCA [16] 

Neighborhood components analysis 

MCML [20] 

Maximally collapsing metric learning 

ITML [23] 

Information-theoretic metric learning 

LDML [8] 

Logistic discriminant metric learning 

DML-eig [22] 

Distance metric learning with eigenvalue opti¬ 
mization 

PLML [39] 

Parametric local metric learning 

KISSME [9] 

Keep it simple and straightforward metric learn¬ 
ing 

PCML 

Positive-semidefinite constrained metric learn¬ 
ing 

NCML 

Nonnegative-coefficient constrained metric 
learning 


learn distance metrics by iterated training of SVM. 
The off-the-shelf SVM solvers such as LibSVM [33] 
can be employed fo solve the metric learning problem. 
Specifically, we propose two novel methods to bridge 
metric learning with the well-developed SVM tech¬ 
niques, and they are easy to implement. First, we pro¬ 
pose a Positive-semidefinite Constrained Metric Learn¬ 
ing (PCML) model, which can be solved via iterat¬ 
ing between PSD projection and dual SVM learning. 
Second, by re-parameterizing the matrix M, we trans¬ 
form fhe PSD consfrainf info a nonnegafive coefficienf 
constraint and consequently propose a Normegative- 
coefficient Constrained Metric Learning (NCML) model, 
which can be solved by iterated learning of fwo SVMs. 
Bofh PCML and NCML have globally optimal solutions, 
and our extensive experiments on UCI dataset classifica¬ 
tion, handwritten digit recognition, face verificafion and 
person re-identification clearly demonstrate the effective¬ 
ness of fhem. 

The remainder of this paper is organized as follows. 
Section 2 reviews the related works. Section 3 presents 
the PCML model and the optimization algorithm. Sec¬ 
tion 4 presents the model and algorithm of NCML. 
Section 5 presents the experimental results, and Section 
6 concludes the paper. 

The main abbreviations used in this paper are sum¬ 
marized in Table 1. 

2 Related Work 

Compared with nonconvex metric learning models [16], 
[17], [40], convex formulation of mefric learning [2], 
[3], [20]-[22] has drawn increasing aftenfions due to 
its desired properties such as global optimality. Most 
convex metric learning models can be formulated as 
SDP or quadratic SDP problems. Standard SDP solvers. 


however, are inefficient for metric learning, especially 
when the size of framing samples is big or fhe feature 
dimension is high. Therefore, cusfomized opfimizafion 
algorifhm needs fo be developed for each specific mefric 
learning model. For LMNN, Weinberger et al. developed 
an efficient solver based on the sub-gradient descent and 
the active set techniques [41]. In ITML, Davis et al. [23] 
suggested an iterative Bregman projection algorithm. 
Iterative projected gradient descent method [3], [42] has 
been widely employed for mefric learning buf it requires 
an eigenvalue decomposition in each iteration. Other 
algorithms such as block-coordinate descent [43], smooth 
optimization [44], and Frank-Wolfe [22] have also been 
studied for mefric learning. Unlike fhe customized algo¬ 
rithms, in this work we formulate metric learning as a 
kernel classification problem and solve it using the off- 
the-shelf SVM solvers, which can guaranfee fhe global 
opfimalify and the PSD property of the learned M, and 
is easy to implement and efficient in training. 

Another line of work aims fo develop metric learning 
algorithms by solving the Lagrange dual problems. Shen 
et al. derived the Lagrange dual of the exponential loss 
based metric learning model, and proposed a boosting¬ 
like approach, namely BoostMetric, where the matrix M 
is learned as a linear positive combination of rank-one 
mafrices [21], [45]. MetricBoost [46] and FrobMefric [47], 

[48] were furfher proposed to improve the performance 
of BoostMetric. Liu and Vemuri incorporated two regu¬ 
larization terms in the duality for robusf mefric learning 

[49] . Note that BoostMetric [21], [45], MetricBoost [46], 
and FrobMetric [47] are proposed for metric learning 
with triplet constraints, whereas in many applications 
such as verification, only pairwise constraints are avail¬ 
able in the training stage. 

Several SVM-based metric learning approaches [50]- 
[53] have also been proposed. Using SVM, Nguyen 
and Guo [50] formulated metric learning as a quadratic 
semidefinite programming problem, and suggested a 
projected gradient descent algorithm. The formulations 
of fhe proposed PCML and NCML in this work are 
different from the model in [50], and they are solved by 
the dual problems with the off-the-shelf SVM solvers. 
Brunner et al. [51] proposed a pairwise SVM method to 
learn a dissimilarity function rather than a distance met¬ 
ric. Different from [51], the proposed PCML and NCML 
learn a distance metric and the matrix M is constrained 
to be a PSD matrix. Do et al. [52] studied SVM from a 
metric learning perspective and presented an improved 
variant of SVM classification. Wang et al. [53] developed 
a kernel classification framework for metric learning and 
proposed two learning models which can be efficiently 
implemented by the standard SVM solvers. However, 
they adopted a two-step greedy strategy to solve the 
models and neglected the PSD constraint in the first step. 
In this work, the proposed PCML and NCML models 
have different formulations from [53], and their solutions 
are globally optimal. 
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3 POSITIVE-SEMIDEFINITE CONSTRAINED 

Metric Learning (PCML) 


Denote by {(x^, yi)| i = 1,2, • • • , N} a training set, where 
Xi € is the ith training sample, and yi is the class label 
of Xi. The Mahalanobis distance between x^ and xj can 
be equivalently written as: 

dh = tr (M'^(xi - xj)(xi - xj)'^) 

^ ( 2 ) 

= (|m, (xi-xj)(xi-xj)'^^, 

where M is a PSD matrix, (A, B) = tr (A^B) is defined 
as the Frobenius inner product of two matrices A and 
B, and tr(*) stands for the matrix trace operator. For 
each pair of x^ and Xj, we define a matrix = (x^ — 
Xj){xi —Xj)^. With Xy, the Mahalanobis distance can 
be rewritten as (xi,Xj) = (M, X^). 

3.1 PCML and Its Dual Problem 


Let S = {(xi,Xj) : the class labels of x^ and xj are the 
same} be the set of similar pairs, and let V = {(xi,Xj) : 
the class labels of x^ and xj are different} be the set of 
dissimilar pairs. By introducing an indicator variable hij 


f 1, if (xi, xj) e V 
}-l, if (xi, Xj) e S, 


the PCML model can be formulated as: 

min i ||M||^+ (7 V. .Cy 


s.t. hij {{M,Xij) + 6) > 1 - > 0, Vi,j 


(3) 

(4) 


M :>= 0, 


where denotes the slack variables, b denotes the bias, 
and ||.||^ denotes the Frobenius norm. 

The PCML model defined above is convex and can 
be solved using the standard SDP solvers. However, the 
high complexity of general-purpose interior-point SDP 
solver makes it only suitable for small-scale problems. 
In order to improve the efficiency, in the following we 
first analyze the Lagrange duality of the PCML model, 
and then propose an algorithm to iterate between SVM 
training and PSD projection to learn the Mahalanobis 
distance metric. 

By introducing the Lagrange multipliers A and a PSD 
matrix Y, the Lagrange dual of the problem in (4) can 
be formulated as: 


max 

A,Y 

S.t. 


^ 11,^_^ 2 _^ 

Xijhij = 0,0 < Xij < C, Vi,j, Y 0. 

I ^ i . d 


(5) 


Please refer to Appendix A for the detailed derivation 
of the dual problem. Based on the Karush-Kuhn-Tucker 
(KKT) conditions, the matrix M can be obtained by 


M — ^ j Xijhij’K-ij + Y. (6) 

The strong duality allows us to first solve the equivalent 
dual problem in (5) and then obtain the matrix M by (6). 
However, due to the PSD constraint Y 0, the problem 
in (5) is still difficult to optimize. 


Algorithm 1 Algorithm of PCML 

Input: S — {(xj,Xj) : the class labels of x^ and xj 
are the same}, V = {(xi,Xj) : the class labels of Xi 
and Xj are different}, and hij. 

Output: M. 

Initialize Y^^), f -s— 0. 
repeat 

1. Update 77(‘+i) with 77g+') = 1 - h,, (X,,, Y(*)). 

2. Update by solving the subproblem (7) 

using an SVM solver. 

3. Update y(‘+') = -E.,, Ag+'^h.^X,,. 

4. Update Y(*+i) = U^^+i^Ag+^^U^^+i)"^, where 

Yg+^^ = u(*+i)A(‘+i)U(‘+i)^ and Ag+^^ = 

max (A(*+h,o). 

5. t i — i -|- 1 . 
until convergence 

M = E.,,Ag-'^h.,X,,+Y(‘-i). 
return M 


3.2 Alternative Optimization Aigorithm 

To solve the dual problem efficiently, we propose an 
optimization approach by updating A and Y alterna¬ 
tively. Given Y, we introduce a new variable rj with 
ijij — 1 hij ()X.ij , Y^} — 1 hij (x^ Xj) Y^ jx^ Xj j, and 
the subproblem on A can be formulated as: 

max o 'y XijXfzihijhki (^ij^'yLj^i) T y 

A 2 ‘ ' ^k,L • ^i,j 

S.t. ^ ^ijhij — 0; 0 ^ Xij ^ Vz,j. 

(7) 

The subproblem (7) is a QP problem. We can define a 
kernel function of sample pairs as follows: 


K ((x*,Xj), (xfc,Xi)) 


(Xy,Xfcz) 

((X, -Xj)'^(xfc -Xi)) . 


( 8 ) 


Substituting (8) into (7), the subproblem on A becomes 
a kernel-based classification problem, and can be effi¬ 
ciently solved by using the existing SVM solvers such 
as LibSVM [33]. Given A, the subproblem on Y can be 
formulated as the projection of a matrix onto the convex 
cone of PSD matrices: 


nun ||Y-Yo||^, s.t. Y :>= 0, (9) 


where Yq = - Ei,j Aij/i^Xy. Through the eigen- 

decomposition of Yq, i.e., Yq = UAU^ and A is 
the diagonal matrix of eigenvalues, the solution to the 
subproblem on Y can be explicitly expressed as Y = 
UA+U^, where A+ = max(A, 0). Finally, the PCML 
algorithm is summarized in Algorithm 1. 


3.3 Optimality Condition 

As shown in [54], [55], the general alternating minimiza¬ 
tion approach will converge. By alternatively updating 
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A and Y, the proposed algorithm can reach the global 
optimum of the problems in (4) and (5). 

The optimality condition of the proposed algorithm 
can be checked by the duality gap in each iteration, 
which is defined as the difference between the primal 
and dual objective values: 


DualGap^cliL 


1 

'2 

1 

+ 2 


m(") 


E, 


\ (") . Y . . 


(n) _ 










( 10 ) 

where M^"^, A^"), and are feasible primal and 

dual variables, and DualGapp^]^j^ is the duality gap in 
the nth iteration. According to (6), we can derive that 




= tr 


= V. . + Y^'^) = Y(") - y(”T (11) 

'^i3 

As shown in Subsection 3.2, Yq"^ = 

Y(») = and hence M(") 

U(n)A0‘)u(")^, where aL”^ = A^p”^ - A(”). Thus, 
can be computed by 

= tr (12) 

= tr = tr (aL”^^) • 

Substituting (11) and (12) into (10), the duality gap of 
PCML can be obtained as follows 

DualGap<jU = C j: f- j:, a'"> + tr (a!">") . 

(13) 

Based on the KKT conditions of the PCML dual prob¬ 
lem in (5), can be obtained by 



where 


o,vaJ^ < c 



,VA, 


(n) 



6^”) = 7^ - , Xy ) , VO < A,^;^ < C. (15) 


Please refer to Appendix A for the detailed derivation of 
and . The duality gap is always normegative and 
approaches to zero when the primal problem is convex. 
Thus, it can be used as the termination condition of the 
algorithm. Fig. 1 plots the curve of duality gap versus the 
number of iterations on the PenDigits dataset by PCML. 
One can see that the duality gap converges to zero in 
less than 20 iterations and our algorithm will reach the 
global optimum. In Algorithm 1, we adopt the following 
termination condition: 


DualGap^*(^j^L < e ■ DualGap^c^j^p, (16) 

where e is a small constant and we set e = 0.01 in the 
experiment. 



Fig. 1: Duality gap vs. number of iterations on the PenDigits 
dataset for PCML. 


3.4 Remarks 

Warm-start: In the updating of A, we adopt a simple 
warm-start strategy. We use the solution of the previous 
iteration as the initialization of the next iteration. Since 
the previous solution can serve as a good guess, warm- 
start results in significant improvement in efficiency. 

Construction of pairwise constraints: Based on the 
training set, we can introduce pairwise constraints 
in total. However, in practice we only need to choose 
a subset of pairwise constraints to reduce the compu¬ 
tational cost. For each sample, we find its k nearest 
neighbors to construct similar pairs and its k farthest 
neighbors to construct dissimilar pairs. Thus, we only 
need 2kN pairwise constraints. By this strategy, we can 
reduce the scale of pairwise constraints from O {N^) 
to 0{kN). Since k is usually small constant (=1~3) in 
practice, the computational cost of metric learning is 
much reduced. Similar strategy for constructing pairwise 
or triplet constraints can be formd in [2], [11]. 

Computational Complexity: We use the LibSVM li¬ 
brary for SVM training. The computational complexity 
of SMO-type algorithms [34] is 0{k3N^d). For PSD pro¬ 
jection, the complexity of conventional SVD algorithms 
is 0{d3). 

4 Nonnegative-coefficient 
Constrained Metric Learning (NCML) 

Given a set of rank-1 PSD matrices Mt = 
rntrnf {t = 1, • • • ,T), a linear combination of Mt 
is defined as M = where at is the scalar 

combination coefficient. One can easily prove the 
following Theorem 1. 

Theorem 1: If the scalar coefficient at >0, Vt, the matrix 
M = utMt is a PSD matrix, where Mt = mjm^ is a 
rank-1 PSD matrix. 

Proof: Denote by u € a random vector. Based on 
the expression of M, we have: 

u^Mu = u^ u 

= ^^atu^mtmfu = ^^at(u^mt)^. 
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Since > 0 and at > 0, Vt, we have u^Mu > 0. 

Therefore, M is a PSD matrix. □ 


4.1 NCML and Its Dual Problem 

Motivated by Theorem 1, we propose to transform the 
PSD constraint in (4) by re-parameterizing the distance 
metric M, and develop a nonnegative-coefficient con¬ 
strained metric learning (NCML) method to learn the 
PSD matrix M. Given the training data S and V, a rank-1 
PSD matrix can be constructed for each pair (x^, ). 

By assuming that the learned matrix should be the linear 
combination of with the nonnegative coefficient 
constraint, the NCML model can be formulated as: 

min JllMII^ + C^. 

s.t. by ((M, Xy) +b)>l- aij > 0, > 0, Vi, j 




CXij Xzj-'. 


mm 

oe.b,^ 

S.t. 


(17) 

By substituting M with aij'Xij, we reformulate the 
NCML model as follows: 

2 '^t,j '^k,i J ^ki) -I- c 

I (X,„ Xfcz) + 6) > 1 - 

By introducing the Lagrange multipliers rf and (3, the 
Lagrange dual of the primal problem in (18) can be 
formulated as: 

max — N ^ ^ ‘kfkl) ‘^kl) 

r),(3 Z ^—^k,l 

+ L! ■■ 

s.t. ^ r]ki {X,j,'Kki) > 0, 0 < (3ij < C, Vi,j 

.. Pijhij= 0. 

(19) 

Please refer to Appendix B for the detailed derivation 
of the dual problem. Based on the KKT conditions, the 
coefficient can be obtained by: 


— Pij hij Tjij. 


( 20 ) 


Thus, we can first solve the above dual problem, and 
then obtain the matrix M by 


M 




( 21 ) 


4.2 Optimization Algorithm 

There are two groups of variables, rj and f3, in problem 
(19). We adopt an alternative optimization approach to 
solve them. First, given rj, the variables /3y can be solved 
as follows: 

ruax “ ^ ' 'y 3ij3klhijhkl -1“ y ^ij 

(3 Z ' ^^-,3 ' ^k^L ' 

s.t. 0 < /3y < C, Wi,j, Pijhij = 0, 

( 22 ) 


Algorithm 2 Algorithm of NCML 

Input: Training set {(xj,Xj), hij}. 

Output: The matrix M. 

Initialize with small random values, t 0. 

repeat 

1. Update (jO+i) with = 

2. Update /jO+i) by solving the subproblem (15) 
using an SVM solver. 

3. Update with = 

(X,„Xfc;). 

4. Update by solving the subproblem (18) 

using an SVM solver. 

5. Update with - hijPf^^y 

6. t i — t 

until convergence 

return M 


where S is the variable with Sij = 
(1 “ ^0 Z ki hki (Xij, Xfc;)). Clearly, the subproblem 
on (3 is exactly the dual problem of SVM, and it can 
be efficiently solved by any standard SVM solvers, e.g., 
LibSVM [33]. 

Given (3, the subproblem on rj can be formulated as 
follows: 

min — y ^ y ^ VijVkl (^ij 1 '^kl) + y ^ . . Vijlij 

V 2 ^-^1,3 ^-^1,3 ^23) 

s.t. y Vij i^ijj'^kl) ^ 0, yi,j, 

where ■jij = Y^u^kihu (X-ij^Xki). To simplify the sub¬ 
problem on T], we derive the Lagrange dual of (23) based 
on the KKT condition: 

hij — hij hijl3ijj (24) 

where n is the Lagrange dual multiplier. The Lagrange 
dual problem of (23) is formulated as follows: 

max — - yy,.. htj^ki (^ij,^ki)+ 
s.t. Hij ^ 0,Vz,j. 

(25) 

Please refer to Appendix C for the detailed derivation. 
Clearly, problem (25) is a simpler QP problem than (23), 
which can be efficiently solved by the standard SVM 
solvers. 

By alternatively updating fi and /3, we can solve the 
NCML dual problem (19). After obtaining the optimal 
solutions of /3, the optimal solution of a in 

problem (18) can be obtained by 

k^ij — hij 1 J. (26) 

We then have M = aij'Kij. The NCML algorithm is 
summarized in Algorithm 2. 
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Analogous to PCML, the updating of /3 and /x in 
NCML can be speeded up by using the warm-start strat¬ 
egy. As shown in Fig. 2, the proposed NCML algorithm 
will converge in 10~15 iterations. 


4.3 Optimality Condition 

We check the duality gap of NCML to investigate the 
optimality condition of it. From the primal and dual 
objectives in (18) and (19), the NCML duality gap in the 
nth iteration is 

DualGap^-^^L = i ^ (X,„X,,) + Cl? 

i,j,k^l i,j 

(27) 

where and are the feasible solutions to the 
primal problem, /3|"^ and are the feasible solutions 
to the dual problem, and DualGapSj^j^ is the duality 
gap in the nth iteration. As and are the optimal 
solutions to the primal subproblem on r; in (23) and 
its dual problem in (25), respectively, the duality gap 
of subproblem on r/ is zero, i.e.. 


E,,E 


2 •“— 
1 


k,l 


(x„-,Xh) + ^ 




(n) (ra) 
'lij hj 


E., E,, N’ri? (x.„ x„) - ^ 7<;V'”’ = 0. 

(28) 

As shown in (26), and should be equal. We 
substitute (28) into (27) as follows: 


DualGap 


NCML - j ‘iij - Pij 






(29) 


Based on the KKT conditions of the NCML dual 
problem in (19), can be obtained by (30) (see page 
7), where [z] = max (z, 0) and can be obtained by 


r("+l) (^1) 

= -^(;) for all 0 < 4”^ < G. 

hij 

Please refer to Appendix B for the detailed derivation 
of and . 

Fig. 2 plots the curve of duality gap versus the number 
of iterations on the PenDigits dataset by NCML. One 
can see that the duality gap converges to zero in 15 
iterations, and NCML reaches the global optimum. In the 
implementation of Algorithm 2, we adopt the following 
termination condition: 



Fig. 2: Duality gap vs. number of iterations on the PenDigits 
dataset for NCML. 


4.4 Remarks 


Computational complexity: We use the same strategy 
as that in PCML to construct the pairwise constraints 
for NCML. In each iteration, NCML calls for the SVM 
solver twice while PCML calls for it only once. When the 
SMO-type algorithm [34] is adopted for SVM training, 
the computational complexity of NCML is O (^k'^N^d). 
One extra advantage of NCML lies in its lower com¬ 
putational cost with respect to d, which involves the 
computation of (X^, X^;) and the construction of matrix 

M. Since (X^-, X^;) = (^(x^ - Xj )^(xfc - x;)^ , the cost of 
computing (Xy , X^;) is 0{d). The cost of constructing 
the matrix M is less than O {kNd^\ and this operation 
is required only once after the convergence of (3 and fi. 

Nonlinear extensions: Note that (X^, Xfc/) = 
tr (XTXfc/) can be treated as an irmer product of two 
pairs of samples: (xi,Xj) and (xfe,x/). Analogous to 
PCML, if we can define a kernel K ((xi,xj), (xk,xi)) on 
(xi,Xj) and (x/c,x;), we can substitute (Xy , X^j) with 
K {{xi,Xj), (xfc,x;)) to develop new linear or even non¬ 
linear metric learning algorithms, and the Mahalanobis 
distance between any two samples x„, and x„ can be 
formulated as: 


(Xm X„) ]VI (Xjji X„) — 

ij fT ( (Xj , Xj ) , I 


E 


Another nonlinear extension 
define a kernel k (x^, Xj ) on x^ 
(Xij,Xfci) = {xfxk-xjx 

we can substitute (5 


as: 


(x„, - x„) M (x,„ - x„) 

f k {Xi,Xm) - k ( 

- • I 

' - k{Xj,Xm) + k{Xj,Xn) 


E„ 


) x„ ) ) . 

(33) 

strategy 

is to 

and Xj 

. Since 

'ji 'Y' \2 

! - xj Xfc -F xj X;) , 

^ij : ^kl ) 

with 

Xj.Xi))^ 

and 

between 

Xm and 


(34) 




DualGap^^j^L < ^ ' DualGap^^^lj^L- 

where e is a small constant and we set e = 0.01 in the 
experiment. 


That is to say, NCML allows us to learn nonlinear metrics 
for histograms and structural data by designing proper 
kernel functions and incorporating appropriate regular¬ 
izations on a. Metric learning for structural data beyond 
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vector data has been recently receiving considerable 
research interests [5], [56], and NCML can provide a new 
perspective on this topic. 

SVM solvers: Although our implementation is based 
on LibSVM, there are a number of well-studied SVM 
training algorithms, e.g., core vector machines [35], 
LaRank [36], BMRM [37], and Pegasos [38], which can 
be utilized for large scale metric learning. Moreover, we 
can refer to the progresses in kernel methods [26]-[28] 
for developing semi-supervised, multiple instance, and 
multitask metric learning approaches. 

5 Experimental Results 

We evaluate the proposed PCML and NCML models for 
Ic-NN classification {k = 1) using 9 UCI datasets, 4 hand¬ 
written digit datasets, 2 face verification datasets and 
2 person re-identification datasets. We compare PCML 
and NCML with the baseline Euclidean distance metric 
and 7 state-of-the-art metric learning models, including 
NCA [16], ITML [23], MCML [20], LDML [8], LMNN 
[2], PLML [39], and DML-eig [22]. On each dataset, if 
the partition of training set and test set is not defined, 
we evaluate the performance of each method by 10- 
fold cross-validation, and the classification error rate and 
training time are obtained by averaging over 10 runs of 
10-fold cross-validation. PCML and NCML are imple¬ 
mented using the LibSVM^ toolbox. The source codes 
of NCA^, ITML^, MCML^ LDML^ LMNN^ PLML^, 
and DML-eig® are online available, and we tune their 
parameters to get the best results. 

5.1 Results on the UCI Datasets 

We first use 9 datasets from the UCI Machine Learning 
Repository [57] to evaluate the proposed models. The 
information of the 9 UCI datasets is summarized in Table 
2. On the Satellite, SPECTF Heart, and Letter datasets, 
the training set and test set are defined. On the other 
datasets, we use 10-fold cross-validation to evaluate the 
metric learning models. 

The proposed PCML and NCML methods involve 
only one hyper-parameter, i.e., the regularization param¬ 
eter C. We simply adopt the cross-validation strategy 
to select C by investigating the influence of C on the 

1. http://www.csie.ntu.edu.tw/'^cjlin/libsvm/ 

2. http://www.cs.berkeley.edu/~fowlkes/software/nca/ 

3. http://www.cs.utexas.edu/~pjam/itml/ 

4. http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_ 
Dimensionality_Reduction.html 

5. http://lear.rnrialpes.fr/people/guillaumin/code.php 

6. http: / /www.cse.wustl.edu/~kilian/code/code.html 

7. http://cui.unige.ch/~wangjun/ 

8. http://empslocal.ex.ac.uk/people/staff/yy267/software.html 


TABLE 2: The UCI datasets used in our experiments. 


Dataset 

# of training 
samples 

# of test 
samples 

Feature 

dimension 

# of 
classes 

Breast Tissue 

96 

10 

9 

6 

Cardiotocography 

1,914 

212 

21 

10 

ILPD 

525 

58 

10 

2 

Letter 

16,000 

4,000 

16 

26 

Parkinsons 

176 

19 

22 

2 

Satellite 

4,435 

2,000 

36 

6 

Segmentation 

2,079 

231 

19 

7 

Sonar 

188 

20 

60 

2 

SPECTF Heart 

80 

187 

44 

2 


classification error rate. Fig. 3 shows the curves of clas¬ 
sification error rate versus C for PCML and NCML on 
the SPECTF Heart dataset. The curves on other datasets 
are similar. We can observe that when C < 1, the 
classification error rates of PCML and NCML will be low 
and stable. When C is higher than 1, the classification 
error rates jump dramatically. Thus, we set C < 1 in our 
experiments. 




(a) (b) 

Fig. 3: Ciassification error rate (%) versus C. (a) PCML; (b) 
NCML. 

We compare the classification error rates of the com¬ 
peting methods in Table 3. On the Cardiotocography and 
Segmentation datasets, PCML achieves the lowest error 
rates. On the Segmentation and SPECTF Heart datasets, 
NCML achieves the lowest error rates. The average ranks 
of competing methods are listed in the last row of Table 
3. On each dataset, we rank the methods based on their 
error rates, i.e., we assign rank 1 to the method with 
the lowest error rate and rank 2 to the method with 
the second lowest error rate, and so on. The average 
rank is defined as the mean rank of one method over 
the nine datasets, which can provide a fair comparison 
of the learning methods [58]. From Table 3, we can see 
that both PCML and NCML achieve the first and second 
best average ranks, respectively, demonstrating strong 
classification capability for general classification tasks. 

We then compare the training time of competing 
metric learning methods in Fig. 4. All the experiments 











TABLE 3: Classification error rate (%) on the UCI datasets. 


Dataset 

Euclidean 

NCA 

ITML 

MCML 

LDML 

LMNN 

PLML 

DML-eig 

PCML 

NCML 

Breast Tissue 

31.00 

41.27 

35.82 

32.09 

48.00 

34.37 

34.13 

33.13 

38.00 

35.37 

Cardiotocography 

21.40 

21.16 

18.67 

22.29 

22.26 

19.21 

18.54 

29.31 

18.50 

18.69 

ILPD 

35.69 

34.65 

35.35 

35.49 

35.84 

34.12 

31.61 

36.87 

33.96 

32.43 

Letter 

4.33 

2.47 

3.80 

4.20 

11.05 

3.45 

3.28 

3.85 

2.67 

2.72 

Parkinsons 

4.08 

6.63 

6.13 

9.84 

7.15 

5.26 

8.84 

7.82 

5.68 

7.26 

Satellite 

10.95 

10.40 

11.45 

15.65 

15.90 

10.05 

11.85 

10.90 

11.15 

11.10 

Segmentation 

2.86 

2.51 

2.73 

2.60 

2.86 

2.64 

2.68 

2.97 

2.12 

2.12 

Sonar 

12.98 

15.40 

12.07 

24.29 

22.86 

11.57 

12.07 

15.07 

12.71 

13.29 

SPECTF Heart 

38.50 

26.74 

34.76 

38.50 

33.16 

34.76 

27.27 

31.02 

28.88 

25.67 

Average Rank 

5.78 

4.56 

5.44 

7.56 

8.44 

4.00 

4.33 

7.00 

3.56 

3.89 


are run in a PC with 4 Intel Core i5-2410 CPUs (2.30 
GHz) and 16GB RAM. Clearly, the proposed PCML and 
NCML are the fastest in most cases. Although DML-eig 
is faster than PCML on the Letter dataset, its classification 
error rate on this dataset is much higher than PCML 
and NCML. On average, PCML and NCML are 23 and 
18 times faster than PLML, the third fastest algorithm, 
respectively. 



Fig. 4: Training time (s) of NCA, ITML, MCML, LDML, LMNN, 
DML-eig, PLML, PCML and NCML. From 1 to 9, the Dataset 
ID represents Breast Tissue, Cardiotocography, ILPD, Letter, 
Parkinsons, Sateiiite, Segmentation, Sonar and SPECTF Heart. 


5.2 Handwritten Digit Recognition 

We further evaluate the proposed methods on four hand¬ 
written digif datasets: MNIST, Pen-based recognition of 
handwritten Digits data set (PenDigits), Semeion and LISPS. 
Table 4 summarizes the basic information of these four 
handwritten digit datasets. On the MNIST, PenDigits, 
and USPS datasets, we use the defined training sets 
to train the metrics, and use the defined test sets to 
compute the classification error rates. On the Semeion 
dataset, we use 10-fold cross-validafion to evaluate the 
metric learning methods, and the classification error rate 
and training time are obtained by averaging over 10 runs 
of 10-fold cross-validafion. 

As the dimensions of images in the MNIST, Semeion 
and LISPS datasets are relatively high, we use prin¬ 
cipal component analysis (PCA) to reduce the feature 
dimension to 100, and train the metrics in the PCA 
subspace. Table 5 lists the classification error rates of 
the ten competing methods on the four handwritten 
digit datasets. The last row of Table 5 lists the average 
ranks of fhe competing methods. We do not report the 


TABLE 4: The handwritten digit datasets used in the experi¬ 
ments. 


Dataset 

# of training 
samples 

# of test 
samples 

PCA di¬ 
dimension mension 

# of 
classes 

MNIST 

60,000 

10,000 

784 

100 

10 

PenDigits 

7,494 

3,498 

16 

N/A 

10 

Semeion 

1,434 

159 

256 

100 

10 

USPS 

7,291 

2,007 

256 

100 

10 


error rate and training time of MCML on the MNIST 
dataset because MCML requires too large memory space 
(more than 30 GB) on this dataset and cannot run in 
our PC. From Table 5, we can see that both PCML 
and NCML achieve the best average rank. Again, the 
results indicate that the proposed methods have better 
classification performance. 

All fhe experiments were executed in the same PC as 
used in Subsection 5.1. Fig. 5 compares the training time 
of NCA, ITML, MCML, LDML, LMNN, DML-eig, PLML, 
PCML, and NCML. Clearly, the proposed PCML and 
NCML methods are much faster than the other methods. 
On average, PCML and NCML are 61 and 27 times faster 
than PLML, the third fastest algorithm, respectively. One 
can conclude that PCML and NCML offer promising 
solutions fo effective and efficient metric learning. 



12 3 4 

Dataset ID 


Fig. 5: Training time (s) of NCA, ITML, MCML, LDML, LMNN, 
DML-eig, PLML, PCML and NCML. From 1 to 4, the Dataset ID 
represents MNIST, PenDigits, Semeion and USPS. 

Finally, we compare the running time of PCML and 
NCML under different feature dimensions d. As ana¬ 
lyzed in Subsections 3.4 and 4.4, the time complexities 
of PCML and NCML are 0{N'^d + d^) and 0{N‘^d), re¬ 
spectively. Fig. 6 shows the training time on the Semeion 
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TABLE 5: Comparison of classification error rate (%) on the handwritten digit datasets. 


Dataset 

Euclidean 

NCA 

ITML 

MCML 

LDML 

LMNN 

DML-eig 

PLML 

PCML 

NCML 

MNIST 

2.87 

5.46 

2.89 

N/A 

6.05 

2.28 

5.06 

2.54 

3.85 

2.80 

PenDigits 

2.26 

2.23 

2.29 

2.26 

6.20 

2.52 

3.75 

2.46 

2.06 

2.06 

Semeion 

8.54 

8.60 

5.71 

11.23 

11.98 

6.09 

5.72 

7.66 

4.83 

5.53 

USPS 

5.08 

5.68 

6.33 

5.08 

8.77 

5.38 

11.36 

6.73 

5.33 

5.43 

Average Rank 

4.00 

6.25 

5.25 

4.67 

9.50 

4.50 

7.50 

5.75 

2.75 

2.75 


dataset with different PC A dimensions. We can see that 
when the dimension is lower than 110, the training time 
of NCML is longer than PCML. When the dimension is 
higher than 110, the training time of PCML increases and 
becomes longer than NCML. 



PCA Dimension 

Fig. 6: Training time (s) vs. PCA dimension on the Semeion 
dataset. 


5.3 Face Verification 

In this subsection, we evaluate the proposed methods for 
face verification using two challenging face databases: 
Labeled Faces in the Wild (LFW) [59] and Public Figures 
(PubFig) [60]. 

5.3.1 The LFW Database 

The face images in the LFW database were collected from 
the Internet and demonstrate large variations of pose, 
illumination, expression, etc. The database consists of 
13,233 face images from 5,749 persons. Under the image 
restricted setting, the performance of a face verification 
method is evaluated by 10-fold cross validation. For each 
of the 10 runs, the database provides 300 positive pairs 
and 300 negative pairs for testing, and 5,400 image pairs 
for training. The verification rate and Receiver Operator 
Characteristic (ROC) curve of each method are obtained 
by averaging over the 10 runs. 

In our experiments, we use the SIFT [61] features and 
the attribute features provided by [8] and [60] to evaluate 
the metric learning methods. Since the dimension of SIFT 
features is high (i.e., 128 x 3 x 9), PCA is used to 
reduce the feature dimension to 150. Under the restricted 
setting of the LFW database, we only know whether 
two images are matched or not for the given pairs. In 
the training stage, we use the training pairs to train 
a Mahalanobis distance metric. In the test stage, we 


TABLE 6: Verification accuracies (%) and training time (s) 
of competing metric learning methods on the LFW-funneled 
dataset under the image restricted setting. 


Method 

Verification Accuracy (%) 

Training Time (s) 

SIFT 

Attribute 

SIFT + At¬ 
tribute 

SIFT 

Attribute 

PCML 

85.70 

84.70 

89.00 

13.22 

14.17 

NCML 

86.45 

85.45 

89.50 

31.62 

27.55 

DML-eig 

[22] 

81.27 

80.13 

85.65 

1931.50 

113.79 

ITML [9] 

82.40 

82.98 

85.50 

3341.80 

3222.40 

LDML 

[8] 

79.27 

83.40 

86.02 

1316.60 

543.08 

KISSME 

[9] 

80.50 

84.60 

85.39 

0.22 

0.05 

Euclidean 

68.10 

75.25 

76.53 

0 

0 


compare the Mahalanobis distance of the test pair with 
a threshold t to decide whether the two images are 
matched or not. 

We report the ROC curves of PCML, NCML, DML-eig 
[22], ITML [23], KISSME [9], LDML [8] and Euclidean 
distance in Eig. 7. We also compare the verification ac¬ 
curacies of PCML and NCML and other metric learning 
methods by using the SIPT and the attribute features 
in Table 6. It can be seen that the proposed PCML and 
NCML methods perform much better than all the other 
competing methods. Using the combination of SIPT and 
Attribute features, the verification accuracies of PCML 
(89.00%) and NCML (89.50%) are higher than the third 
best method, i.e. DML-eig (85.65%), by 3.35% and 3.85%, 
respectively. We also compare the training time of the 
competing methods in Table 6. The training time of 
PCML and NCML is shorter than the other methods 
except for KISSME. The reason is that KISSME is a one- 
pass training approach. Although KISSME is faster, its 
verification accuracy is much lower than PCML and 
NCML. 

5.3.2 The PubFig Database 

The PubPig database [60] contains 58,797 face images of 
200 persons with large variations in pose, lighting, ex¬ 
pression, scene, camera, imaging conditions and param¬ 
eters, etc. In this database, the face verification methods 
are also evaluated using 10-fold cross validation. Among 
the given 20,000 image pairs, we randomly select 18,000 
pairs for training and use the remaining 2,000 pairs for 
testing in each run. The ROC curves and verification 
rates are obtained by averaging over the 10 runs. 
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- PCML 

- NCML 

DML-eig 

- ITML 

- KISSME 

LDML 

Euclidean 


0.2 0.4 0.6 0.8 

False Positive Rate (FPR) 



(a) SIFT 


(b) Attribute 


(c) SIFT + Attribute 


Fig. 7: The ROC curves of different metric learning methods on the LFW-funneled dataset under the image restricted setting [8], 
[9], [22]. (a) SIFT feature; (b) Attribute feature; (c) SIFT + Attribute feature. 


We use the attribute features provided by [60] to 
evaluate the competing methods. Fig. 8 shows the ROC 
curves of PCML, NCML, KISSME [9], ITML [23], DML- 
eig [22], Attribute Classifiers [60] and the baseline Eu¬ 
clidean distance. It can be seen that the performance of 
PCML and NCML is similar, and is superior to that of 
the other methods. 

We further report the verification rates of PCML, 
NCML and the other methods in Table 7. One can 
see that PCML and NCML perform better than the 
other methods. The accuracies of PCML (79.71%) and 
NCML (79.75%) are higher than the third best method, 
i.e.. Attribute Classifiers (78.65%), by 1.06% and 1.10%, 
respectively. The training time of PCML, NCML and 
other metric learning methods is also listed in Table 7. It 
can be seen that PCML and NCML are much faster than 
ITML and DML-eig. 


0.8 

S' 

£ 




- PCML 

^0.4 


- NCML 
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- DML-eig 

H 


Attribute Classifiers 

0.2 


- KISSME 



- ITML 


- Euclidean 


o'—^^^^^^^^^— 

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

False positive rate (FPR) 


Fig. 8: The ROC curves of different methods on the PubFig 
database (the curves of PCML and NCML almost coincide). 


5.4 Person Re-identification 

In this subsection, we evaluate the performance of the 
proposed methods for person re-identification, i.e., rec¬ 
ognizing a person at different locations and at differ¬ 
ent times [62]. Two challenging person re-identification 


TABLE 7: Verification accuracies (%) and training time (s) of 
competing methods on the PubFig database. 


Methods 

Verification 
Accuracy (%) 

Training Time 
(s) 

PCML 

79.71 

118.55 

NCML 

79.75 

216.38 

KISSME [9] 

77.60 

0.09 

ITML [9] 

69.30 

3796.50 

Attribute Classifiers 
[60] 

78.65 

- 

DML-eig [22] 

77.36 

1132.30 

Euclidean 

72.50 

0 


databases, the Viewpoint Invariant Pedestrian Recog¬ 
nition (VIPeR) database [63] and the Context Aware 
Vision using Image-based Active Recognition for Re- 
Idenfificafion (CAVIAR4REID) database [64] are used to 
assess the performance of the proposed methods. 

5.4.1 The VIPeR Database 

The VIPeR database contains 1,264 pedestrian images of 
632 persons from fwo camera viewspoints (camera A 
and camera B). For each person, there are two images 
taken from different viewpoints with a change of 90 
degrees. In our experiments, we randomly select 316 
persons and use their images for training, and use the 
images of the other 316 persons for testing. For the 
testing images, we use the images taken by camera B 
as the probe set and the images from camera A as the 
gallery set. Finally, 10 partitions of training and test sets 
are constructed, and the average accuracy over the 10 
test sets is computed as the final accuracy. 

We report the Cumulative Matching Characteristic 
(CMC) curves of the competing methods in Fig. 9. We 
also compare their accuracies under different ranks in 
Table 8. From Fig. 9 and Table 8, one can see that 
both PCML and NCML outperform LMNN, ITML and 
Euclidean distance significantly under all ranks. When 
the rank is no more than 25, PCML performs similarly 
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TABLE 8: Person re-identification accuracies (%) and training TABLE 9: Person re-identification accuracies (%) and training 
time (s) on the ViPeR dataset. time (s) on the CAVIAR4REiD dataset. 


Methods 

Accuracy (%) 

Training 

Methods 

Accuracy (%) 

Training 

Rank 

Rank 

Rank 

Rank 

Rank 

Time 

(s) 

Rank 1 

Rank 5 

Rank 10 

Rank 15 

Time (s) 


1 

25 

50 

80 

100 

PCML 

32.86 

61.26 

76.06 

85.34 

11.47 

PCML 

19.40 

80.60 

93.77 

97.25 

98.23 

4.94 

NCML 

32.27 

60.38 

75.33 

84.25 

19.23 

NCML 

21.04 

82.28 

93.07 

97.25 

98.32 

9.05 

DML-eig 

30.68 

57.15 

73.18 

82.64 

829.24 

KISSME 

19.60 

80.70 

91.80 

96.68 

97.78 

0.07 

[22] 






[9] 







LMNN 

28.66 

56.53 

71.30 

81.19 

95.62 

LMNN 

16.61 

72.94 

88.13 

94.30 

96.36 

437.43 

[9] 






[9] 







ITML [9] 

31.48 

59.56 

74.83 

84.15 

2819.18 

ITML [9] 

15.66 

74.21 

88.29 

95.41 

96.99 

1199.10 

KISSME 

29.87 

54.75 

71.36 

82.15 

1.12 

DML-eig 

8.07 

50.47 

65.82 

77.69 

82.44 

47.03 

[9] 






[22] 







Euclidean 

27.98 

50.67 

66.25 

77.54 

0 

Euclidean 

10.90 

44.94 

60.76 

70.09 

74.37 

0 








to KISSME, while NCML outperforms KISSME. When 
the rank is between 25 and 200, both PCML and NCML 
perform better than KISSME. The training time of the 
metric learning methods is also reported in Table 8. We 
can see that both PCML and NCML are much more 
efficient than LMNN and ITML in training. 



qI-^^^^- 

0 50 100 150 200 250 300 350 

Rank 


Fig. 9: The CMC curves on the ViPeR dataset. 

5.4.2 The CAVIAR4REID Database 
CAVIAR4REID consists of 1,220 pedestrian images from 
72 persons, where the images are extracted from the 
shopping center scenario of the CAVIAR database [64]. 
The database covers a large range of image resolution 
and pose variation. The minimum and maximum image 
sizes in the CAVIAR4REID database are 17 x 39 and 
72 X 144, respectively. Eollowing [65] and [10], we use 
the hierarchical Gaussian (HG) features to evaluate the 
metric learning methods. 

According to the evaluation protocol in [10], we ran¬ 
domly select 36 persons and use their images for train¬ 
ing, and use the rest images for testing. Eor the testing 
images, we randomly select one image for each person to 
construct a probe set consisting of 36 images, and use the 
other test images as the gallery set. Einally, 10 partitions 
of training and test sets are constructed, and the final 
results are obtained by averaging over the 10 runs. 

We report the CMC curves of PCML, NCML, DML-eig 
[22], KISSME [9], ITML [23], LMNN [19] and Euclidean 


distance in Eig. 10. One can see that PCML and NCML 
perform the best and the second best among all the 
competing methods, respectively. Table 9 lists the re¬ 
identification accuracies and training time by different 
methods. PCML and NCML perform better than the 
other metric learning methods rmder all the ranks. We 
also report the training times of the competing metric 
learning methods in Table 9. PCML and NCML are much 
faster than the other metric learning methods except for 
KISSME. 



Rank 

Fig. 10: The CMC curves on the CAVIAR4REID dataset. 

6 Conclusion 

We proposed two distance metric learning models, 
namely Positive-semidefinite Constrained Metric Learn¬ 
ing (PCML) and Nonnegative-coefficient Constrained 
Metric Learning (NCML). The proposed models can 
guarantee the positive semidefinite property of the 
learned matrix M, and can be solved efficiently by the 
existing SVM solvers. Experimental results on nine UCI 
machine learning repository datasets and four hand¬ 
written digit datasets showed that, compared with the 
state-of-the-art metric learning methods, including NCA 
[16], ITML [23], MCML [20], LDML [8], LMNN [2], 
PLML [39], and DML-eig [22], the proposed PCML and 
NCML methods can not only achieve higher classifica¬ 
tion accuracy, but also are much faster in training. On 
average, they are 35 and 21 times faster than PLML, 


































12 


the 3rd fastest metric learning method, respectively. 
The experimental results on LFW, PubFig, VIPeR and 
CAVIAR4REID databases indicate that the proposed 
methods also perform very well in vision tasks such as 
face verificafion and person re-identification, leading to 
higher verification rates and very competitive training 
efficiency. 

Appendix A 
The Dual of PCML 

The original problem of PCML is formulafed as 


if \ij > 0. Thus we can simply take any training point, 
for which 0 < < C, fo compute b by 

b = ^ -{M, Xy), for all 0 < A^- < C. (45) 

hij 

Nofe thaf it is numerically wiser to take the average over 
all such training data points to compute b. After b is 
computed, we can compute ^ij by 


^ij — 


0, for all Xii < C 


' [1 - /ly ((M, X,,) + &)] for all A,, = C, 


(46) 


min - ||M||p-I- 

M,b,$ 2" ^ 

s.t. hij ((M, Xij) +b)> > 0, Vz,j 

M ^ 0. 


(35) 


Ifs Lagrangian is: 

L(A,/^,Y,M,6,0 = ^I|M||^ + cV. 

^ ‘^■>3 

— ^ . . \ij [hij ((M, Xij) -|- 6 ) — 1 -|- 

'^i3 

~ ^ ~ (Y, M) , 

• * ^ 1 


(36) 




where A, n and Y are the Lagrange multipliers which 
satisfy Xij > 0, Kij > and Y 0. Converfing fhe 

original problem fo ifs dual problem needs the following 
KKT condifions: 


where the term [z]^ = max(z, 0 ) denotes the standard 
hinge loss. 

Appendix B 
The Dual of NCML 

The original problem of NCML is as follows: 

2 ,'^ki) + C 

s.t. hij I ^ki 0^ij,'^ki) + > 1 — 

^ij ^ 0 , ^ 0 , Vi,j. 

Its Lagrangian can be defined as: 

L (/3, cr, l/, Q;, 6, ^ ^ - ■ K 7 {^ij J ^kl) H" C' ^ ^ 


dL (A,k:,Y,M, 6 ,^) 
dM 

dL (A,k,Y,M,&, 0 
dL(A,K,Y,M, 6 ,g) 

d^ij 


= 0 M — ^ ^ XijhijU-ij — Y = 0, 

(37) 

= =0’ (38) 

(39) 


^ ^ [3ij hij ( ^ ^ 1,1 ’ ~^ki ) T 6 j 1 “t" ^ij 

Z L \ rCt / 

~'^i i ~Yi i 


— C' ^d — 9 —^ 


0 < Xij < C, 

hij ((M, Xy) -|- 6 ) — 1 -|- ^ij > 0, ^ij > 0, 


Xij >0, K 


ij >0, Y 0, 


Xij [hij ((M, Xy) -|- 6 ) — 1 -|- ^ij] — 0, Kij^ij — 0. 


(40) 

(41) 

(42) 


(48) 

where /?, a and iz are the Lagrange multipliers which 
satisfy f3ij > 0, (jy > 0 and izij > 0, Vf,j. Converting the 
original problem to its dual problem needs the following 
KKT condifions: 

dL {(3,(T,v,a,b,^) 

duij 

^ (^d ’ ^^kl ) I l^klhkl {^ij j ^kl) ^ij — 0, 

(49) 


= 0 


Equafion (37) implies the following relafionship between 
A, Y and M: 


M — ^ ' Xij hij H-ij + Y. 


(43) 


dL {l3,a,u,OL,b,$) 
db 

dL {f3,cr,v>,cx,b,^) 

d^ij 


= 0=^Y Pijhij = 0, (50) 


= 0 C — Pij — Vij = 0 =J> 


Substituting (37)~(39) back into the Lagrangian, we get 
the following Lagrange dual problem of PCML: 

Y n.^_, 2 _, 

max-> Xiihii'XLii-\-Y +> Xu 

A.Y 2 \\^^,3 F ^ (44) 

s.t. 'Y. Xijhij = 0,0 < Xij < C,yi,j, Y 0. 


'z,3 


As we can see from (43) and (44), M is explicifly 
defermined by the training procedure, but b is not. 
Nevertheless, b can be easily found by using fhe KKT 
complementarity condition in (39) and (42), which show 
that ^ij = 0 if Xij < C, and hij ((M, X^) -I- 6 ) -1 =0 


0 < Pij < C, 

ij, '^kl) 3" -1-1- ^ij > 0, 

Cd — 0; kXij ^ 0, 

Pij ^ 0 , (Jij ^ 0 , l^ij ^ 0 , j,, 

Pij hij ^ ^kl (^dA ~^kl} ~h 6^ 1 -j- ^ij 

k'ij^ij — 0 , O'ijOtij = 0 , ^i^j. 


= 0 , 


(51) 

(52) 

(53) 

(54) 


Here we infroduce a coefficient vector 77 , which satisfies 
(Jij ^3/k,ihki i'^ij^'^ki}' Note that (X^j^X/;;/) is a pos¬ 
itive definite kernel. So we can guarantee that every 77 
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corresponds to a unique a, and vice versa. Equation (49) 
implies the following relationship between ot, f3 and rj: 

“ 1 “ 

Substituting (49)~(51) back into the Lagrangian, we get 
the Lagrange dual problem of NCML as follows: 

max — \ “t“ Vij') 4“ Vkl^ 


S.t. ^ r]ki {Xij,Xki) > 0, 0 < Pij < C, ^i,j 

'^hi , = 0- 

(56) 

Analogous to PCML, we can use the KKT complemen¬ 
tarity condition in (50) to compute b and in NCML. 
Equations (51) and (54) show that = 0 if /3ij < C, and 
% iJ2ki (Xy, Xfci) + 6 ) - 1 Cii = 0 if Pij > 0. Thus 
we can simply take any training data point, for which 
0 < Pij < C, to compute b by 




After obtain b, we can compute Pij by 

r o,v A, <c 

^ I [i - f^b (Efc I Xfci) + &)] , V Aj = C, 

(58) 

where the term [z]_^_ = max(z, 0 ) denotes the standard 
hinge loss. 


Appendix C 

The Dual of the Subproblem on r/ in 

NCML 

The subproblem on rf is formulated as follows: 


VijVkl '^kl) ^ ^ . . Vij'Jij 

k,i (59) 


TZ i. 7 ’ib m O^ij , Xfci ) + V . . ?7„- 

V 2. ^—>1,3 ^—'k,l ^—‘1,3 ( 59 ) 

(X„-, Xfci) >0, Vf, j, 

where 7 ,^- = Y.k,iPkihki (Xy,Xfc;). Its Lagrangian is: 

L (M) 'n) Q 'y .. y , VijVki (Xij, X/^i) 

2 ^^,3 ^k,l (gQ) 

+ y ' .. Vijiij ~ y ' ■. Mb y , ^ki (x^, x^j), 

where /x is the Lagrange multiplier which satisfies > 
0, V*) J- Converfing the original problem to its dual prob¬ 
lem needs the following KKT condition: 


dL 


= 0 =7 


y j Vki {Xij, Xki) + 7 ij I (^b 7 Xfci) — 0 . 

(61) 


Equation (61) implies the following relafionship between 
/X, rj and /3: 

Mb — Mb (62) 

Substituting (61) and (62) back into the Lagrangian, 
we get the following Lagrange dual problem of fhe 
subproblem on rj: 

max — — y ' y ^J-ijfJ^ki (X-ij, Xf^i) y ^ 

A* ^ ^■>3 

~ 2 j ^Zfc I PijPkihijhki {Xij , Xfcj) 

s.t. Pij> 0 ,Vx,j. 

(63) 

Since /3 is fixed in fhis subproblem, 

PijPkihijhki {Xij,Xki) remains constanf in 
(63). Thus we can omit this term and have the following 
simplified Lagrange dual problem: 

^ j- Hfc.z f^bMfci (X^3 , Xh) + 7^3M^3 

s.t. Pij>0,Vx,j. 

(64) 
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